Predicting the risk of hypertension using machine learning algorithms: A cross sectional study in Ethiopia

Background and objectives Hypertension (HTN), a major global health concern, is a leading cause of cardiovascular disease, premature death and disability, worldwide. It is important to develop an automated system to diagnose HTN at an early stage. Therefore, this study devised a machine learning (ML) system for predicting patients with the risk of developing HTN in Ethiopia. Materials and methods The HTN data was taken from Ethiopia, which included 612 respondents with 27 factors. We employed Boruta-based feature selection method to identify the important risk factors of HTN. The four well-known models [logistics regression, artificial neural network, random forest, and extreme gradient boosting (XGB)] were developed to predict HTN patients on the training set using the selected risk factors. The performances of the models were evaluated by accuracy, precision, recall, F1-score, and area under the curve (AUC) on the testing set. Additionally, the SHapley Additive exPlanations (SHAP) method is one of the explainable artificial intelligences (XAI) methods, was used to investigate the associated predictive risk factors of HTN. Results The overall prevalence of HTN patients is 21.2%. This study showed that XGB-based model was the most appropriate model for predicting patients with the risk of HTN and achieved the accuracy of 88.81%, precision of 89.62%, recall of 97.04%, F1-score of 93.18%, and AUC of 0. 894. The XBG with SHAP analysis reveal that age, weight, fat, income, body mass index, diabetes mulitas, salt, history of HTN, drinking, and smoking were the associated risk factors of developing HTN. Conclusions The proposed framework provides an effective tool for accurately predicting individuals in Ethiopia who are at risk for developing HTN at an early stage and may help with early prevention and individualized treatment.

Introduction Hypertension (HTN), defined as the elevated blood pressure beyond its normal ranges, is a major public health concern with its raising prevalence and effect among the adults' overtime worldwide [1][2][3]. It is one of the most common serious chronic non-communicable diseases. Hypertensive people are affected by different types of cardiovascular diseases (CVDs), e.g., coronary heart disease, stroke, peripheral arterial disease, aortic disease, myocardial infarction [4][5][6][7], which are the leading cause of disability, morbidity and mortality that increase the economic burden of out-of-pocket expenditures (OOPE) [8][9][10]. As reported by World Health Organization (WHO), worldwide around 9.4 million people were died due to HTN every year [10]. According to Belay et al., [2022], globally the prevalence of HTN was 26% in 2000 and it was projected to reach around 1.56 billion (29.2%) by 2025 [11]. The latest estimation by WHO in 2021 revealed that about one-third (31.1%) of the world's adult population had HTN (1.39 billion); of whom 2/3 were from in low and middle-income countries (LMICs) [12]. Also, a systematic analysis of population-based studies from 90 countries, including Ethiopia estimated that HTN among adults was more prevalent in LMICs (31.5%) than the highincome countries (28.5%) [13]. Different epidemiological studies in Ethiopia reported that the prevalence of HTN was ranging from 7.7%-41.9% [14]. Moreover, the prevalence of HTN is disproportionately more prevalent and it increases alarmingly in poor resource countries, like Ethiopia [11]. But it might be helpful to mitigate and manage/control the risk of HTN if identification of HTN patients with interpretable risk factors at an early stage. Thus, early detection of HTN patients with identification of interpretable risk factors plays a key role, which could help to get the patients timely prevention and intervention. It is therefore highly essential to detect/diagnosis and identify the interpretable risk factors of HTN at an early stage.
Many convincing research and empirical studies determined several risk factors associated with HTN in LMICs countries, including Ethiopia [15][16][17][18][19][20][21]. Nevertheless, existing association studies had several limitations. Most importantly, previous existing studies considered traditional linear models, such as logistic regression (LR), Cox proportional hazard model, for identifying the significantly associated risk factors of HTN [22][23][24]. Moreover, a real data with high-dimensional non-linear pattern presents a challenge to traditional linear models, and low precision of linear models impedes patients-level use. To overcome those limitations with complex real data, machine learning (ML) might be a right choice, which is being widely used in current public health research fields. ML is a subset of artificial intelligence (AI), in which the algorithms that execute the prediction process collect the necessary information from previous experiences and/or detect patterns in data to accomplish a task, typically a classification or identification [25][26][27][28]. It can provide several advantages, including automatic specific process, reliable probabilistic estimation for uncovering hidden patterns or relationships with high accuracy while lowering labor costs and time for large amounts of data that aid in decision-making or inference, and model interpretability [29][30][31]. There are different types of learning algorithm in ML, among them supervised learning is the most popular and widely applicable. The supervised learning algorithm's goal is to use the dataset to build a model that can predict the system's output given new inputs. The major two types of supervised learning algorithm are regression and classification. Example of regression include linear regression and logistic regression [32]. Examples of classification include ensemble methods, decision trees (DT), k-nearest neighbors (kNN), support vector machine (SVM), Naïve Bayes (NB), artificial neural network (ANN), so on [32,33]. The ensemble method is a machine learning technique that combine multiple models with the same learning algorithm to achieve better predictive performance [34]. Ensemble methods include eXtreme gradient boosting (XGB), adaBoost, histogram-based gradient boosting classification Tree, and random forest (RF) [25]. However, previously, some researcher's conducted their study to develop multivariable prediction models using several ML and explainable artificial intelligence methods [35][36][37]. Most of the existing risk prediction models were developed with limited number of risk factors that provided less accuracy for predicting HTN patient [35,38]. However, DT and ensemble approaches have attracted a great attention in recent years for identifying individuals at risk of HTN, there is no evidence that these algorithms are successfully applied in Ethiopian clinical settings.
To the best of our knowledge, this is the first study that applied and builds a predictive model using ML algorithms for predicting the individual risk of HTN in Ethiopia. Thus, the objective of this study was to develop an efficient ensemble based explainable ML framework for predicting patients with the risk of HTN in Ethiopia.
Furthermore, we employed under-sampling and adaptive syntactic (ADASYN) class balancing strategy to enhance the confidence score of the developed prediction models. For model interpretation, we identified the key risk factors of HTN and direction of the relationship between the risk factors and HTN using SHapley Additive exPlanations (SHAP), which is a post hoc model interpretation technique viz. theoretically based on the Shapley value. The overall pipeline of the explainable machine learning based framework is displayed in Fig 1. The layout of this paper is presented as follows: Materials and methods included data source, statistical analysis, feature selection, machine learning algorithms, performance evaluation criteria, and model interpretability. The results are presented in section 3 and discussed in section 4. Finally, conclusion is represented in section 5.

Data source
The community-based cross-sectional data used in this investigation were collected in 2017 by the Hawassa city administration and made available to the public by Paulose et al. [39]. The data were collected through multistage random sampling and comprised a total of 633 respondents, ranging in age from 31 to 90, and residing in the city for at least six months. The sample size was determined by using the formula of sample size determination method, which considered the design effect of 1.5, the 95% confidence interval, the 5% margin of error, the 80% power, the proportion of 50% (to maximize sample size), and the 10% non-response rate [39]. Different levels of explanatory variables were included as individual risk factors of HTN and categorized the quantitative variables based on the previous sittings [18][19][20]39]. A brief explanation of the included risk factors has been presented in Table 1. In this study, a patient with HTN is determined based on WHO cutoff (�140/90 mmHg and/or diastolic pressure �90 mmHg and/or being on medication of HTN at the time of data collection) [40]. Finally, a total of 612 respondents were incorporated in this study after eliminating all the missing values.

Statistical analysis
The baseline and demographic characteristics of the patients were presented in percentage (%) for categorical and mean ± SD (standard deviation) for continuous data. Pearson χ 2 -test was employed to determine the association between categorical risk factors and HTN, whereas for continuous risk factors, independent sample t-test was used to examine the mean difference between the HTN groups (HTN vs. non-HTN) for normally distributed data. Two-sided test was performed and a p-value of <0.05 was considered statistically significant for all the tests.

Feature selection
Feature selection (FS), or risk factor identification is also known as variable selection, or subset selection in statistics and ML. The identification of risk factors is a method for selecting the relevant features by removing the irrelevant or redundant features from the dataset. In this study, Boruta-based feature selection method (FSM) was adopted to identify the relevant features. Boruta is a wrapper-based feature selection method that employs the random forest classifier algorithm. This method has a wider range of applications and performs better than others as it is unbiased and steady [41].

Machine learning algorithms
This study used three different types of supervised ML algorithms for predicting patients with the risk of HTN (Table 2).

Logistic regression
Logistic regression (LR) is a most popular supervised ML-based algorithm that leverages the idea of probability. Logistic regression (LR) is a most popular supervised ML algorithm mainly used for classification task [42]. The LR model employs the logistic function to estimate the probability of the response variable (HTN and non-HTN) in terms of one or more input features. The logistic function can be represented as follows where, p j denote the probability of HTN and (1−p j ) denote the probability of non-HTN for j th individual; X kj is the k th input feature of the j th individual and β k is the k th regression coefficients.
The above Eq (1) can be expressed as and odds as If p 1À p > 1, then we classify as HTN, while p 1À p < 1, then we classify as non-HTN.

Artificial neural network
Artificial neural network (ANN) is a non-linear modeling algorithm that is inspired by the structure and function of human brain. It consists of interconnected processing nodes that are organized by three different types of layers: input, hidden, and output. The input layer is connected to hidden layer with updated weight, and hidden layer is connected to the output. In this method, X = x 1 ,. . .,x k are used as the input vector in back propagation (BP) algorithm for learning as well as mapping the relationship between input features and outcome variable. The BP algorithm propagates the error between the input risk factors and outcome variable by adjusting weights of hidden layers via backward direction with non-linear sigmoid activation function [43]. The sigmoid activation function is defined as This procedure is repeated iteratively until no change iteration values or not getting the minimum error.

Random forest
Random forest is a popular machine learning algorithm that developed by Leo Breiman and widely used in classification and regression problems [44]. It is based on the concept of ensemble learning algorithm that trains multiple decision tree on random subsets of the data to solve the problem. The RF-based model is constructed by using the following steps: Step1: The given training data set (X ij , i = 1, 2. . . k, j = 1, 2. . . n), select randomly risk factors from training dataset by using bootstrap sampling procedure.
Step 2: Built a decision tree (DT) for creating new subset.
Step3: Repeat Step1 and Step2, until construct many trees and consist of a forest.
Steps 4: Consider the prediction result from each created DT and select final prediction with the help of majority voting.

Extreme gradient boosting
Extreme gradient boosting (XGB) is an efficient ensemble-based machine learning algorithm that uses decision trees and gradient boosting algorithm. It is highly adaptable and working in most classification problem, especially HTN disease prediction [45]. Boosting is a learning algorithm, which attempts to create a strong classifier based on weak learners or classifiers. The weak and strong classification models mention to the correlation of predicted and actual class. By adding classifiers on top of each other iteratively, the next classifier can modify the errors of the earlier one. This procedure is repeated until the training data set accurately predicts the membership class label of the target variable.

Data partition and balancing
We randomly divided the whole dataset into two sets as 70% training set [HTN: 91 (21.2%), non-HTN: 338 (78.8)] and 30% testing set HTN: 39 (21.3%), non-HTN: 144 (78.7)] using stratified sampling procedure [46]. Membership class label of the data was imbalance i.e., skewed class distribution of observations. Imbalance class problem of a data provided a biased result for the majority class of the response variable in classification task [47,48]. To deal this problem, several data balancing strategy are widely applicable. Among them, under-sampling and Adaptive synthetic (ADASYN) balancing strategy were executed in the training set to balance the data. ADASYN is the newly generalized version of synthetic minority oversampling technique (SMOTE) and generates new sample for the minority class using a weighted distribution [49].

Cross validation and tune hyperparameters
The mentioned above four ML algorithms (LR, ANN, RF, and XGB) have other parameters, called hyperparameters. Hyperparameters are those parameters that the user explicitly defines before the learning process to improve the model performance. The grid search method with repeated10-fold (K10) cross-validation protocol was used to tune the hyperparameter values in the training set. The training dataset is divided into a 9:1 ratio as a training subset and a verification set to perform the K10 protocol. The caret package (version 6.0-93) in R was used to generate the optimal hyperparameter values for four models, which are displayed in Table 3.

Performance evaluation criteria
The performance of selected four ML models was evaluated by five popular evaluation criteria: accuracy, precision, recall, F-score, and area under the curve (AUC). The values of performance evaluation criteria were calculated from the confusion matrix by four measures (Table 4): True positive (t p ): model predicted the disease group as HTN where actual group was HTN,  Accuracy. It is used to assess the overall accuracy for the models. It is defined as the ratio of the sum of true cases (t p and t n ) against total number of cases. Accuracy is defined mathematically as Precision. It is the ratio of t p cases against the predicted positive (DR) cases. It is also called positive predictive value and used to assess the reliability for predicting the model as positive. Precision is defined mathematically as Recall. It is the ratio of t p cases against the actual positive cases (DRs). Model with high recall indicates low f n . It's also called sensitivity or true positive rate (TPR). Recall is defined mathematically as F1-score. It is a harmonic mean of precision and recall. F-score is defined mathematically as

Area under the curve
The AUC is defined as an integral of the receiver operating characteristic (ROC) function over the given range and used to assess the quality of the built predictive model. The mathematical formula of AUC is as follows A ROC curve is a plot of TPR or sensitivity on the y axis against false positive rate (FPR) or 1-specificity on the x axis for different cutoff values. The ROC curve is broadly used in medical diagnosis as another single-number measure for evaluating the predictive validity of ML-based model [50]. ROCs generate an AUC value from 0 to 1.

Model interpretability
Shapley additive explanations (SHAP) is an interpretability visualization approach, which is constructed based on Shapley values. This method was introduced by Lundberg and Lee (2017), and widely used to explain the local and global importance using SHAP value by computing the contribution of each risk factor in the ML-based prediction model [51]. The explanation value of SHAP was initially established from coalitional game theory, where each predictor is used as an individual player in a game or coalition. SHAP values framework offers a fair solution for each player in a model outcome, and provides a series of desirable properties/axioms, including consistency, efficiency, dummy, and additively [52]. The efficiency property of SHAP method provided better reliable results compared to another methods, for example local interpretable model-agnostic explanations [53]. Risk factors contribute to the model's outcome or prediction with different magnitude and sign, which is accounted for by Shapley values. Accordingly, Shapley values represent estimates of feature importance magnitude of the contribution and its direction (sign). Risk factors with positive SHAP value contribute to predict patent with HTN in the model, whereas risk factors with negative SHAP value contribute to predicting patients with control in the model. Particularly, the importance of each risk factor, say k th risk factor, is measured by the Shapley value defined by the following formula where, S denotes the subset of risk factors, that does not include the risk factor for which we are calculating the value of ; k (v); S[{k} is the subset of risk factors, that includes in S and the k th risk factor; v(S) corresponds to the outcome of the ML-based model that explain using the risk factors of S; S�M\{k} represents all sets of S that are subsets of the full set of M risk factors, excluding the k th risk factor.

Risk factors selection using Boruta
The result of Boruta based feature selection method is presented in Fig 2. The method showed that age, occupation, PA, walking, diabetes, height, weight, BMI, smoking, drinking, vegetable,  fat, transport, HD, wealth, and HHTN were the important risk factors of HTN. The selected risk factors were included to construct the ML-based model for prediction of HTN status (HTN or non-HTN).

Performance comparisons of ML-based models
The performance of four ML-based models with under-sampling and ADASYN shown in Table 6 and S1 Fig. It Fig 3. The ROC curves and precision recall curves also indicated that the XGB model reached significantly better than other models as LR, ANN, and RF. Therefore, in comparison to other models, our results showed that the XGB-based model with ADASYN performed well.

Interpretable risk factors of hypertension
SHAP analysis was executed to determine the interpretable predictive risk factor of HTN for the suited prediction model (XGB) based on the SHAP values. Fig 4(A) explains the global

Discussion
In this study, we investigate several ML-based algorithms to propose an explainable framework for predicting the risk of HTN in Ethiopia. We trained up four ML algorithms (ANN, SVM, RF, and XGB) to predict HTN, using 16 risk factors obtained from Boruta feature selection method. The performance of the developed models compared by accuracy, precision, recall, F1-score, and ROC curve with AUC value on testing set. Based on performance measurements, we proposed XGB model as the most appropriate candidate classifier for predicting HTN. Several studies were conducted using ML framework to predict the risk of HTN. A comparison of the present study with the existing studies is presented in Table 7. Chowdhury et al. [54] proposed a system on 18,322 respondents with 24 candidate risk factors in Canada. Before constructing the models, they applied five top FSM for selecting the significant risk factors and adopted five ML algorithms LASSO, Elastic Net, random survival forest (RSF), and gradient boosting, with the conventional Cox proportional hazard model for predicting HTN. They measure the performance of the models by C-index for each model. Pratiwi OA [35] applied four ML algorithms such as DT, RF, GB, and LR for predicting individual risk of HTN in Indonesia. He developed the model by K10 protocol based on training set and prediction performance of these models was measure on testing set in terms of accuracy, precision, recall, F1-score, and AUC. He indicated LR is the best performer marginally compared to others with AUC (0.829). Oanh and Tung [55] suggested a ML based model to predict patient with the risk of HTN in Vietnam. The model was developed by Naïve Bayes (NB), multilayer perceptron (MLP), decision tree (DT), k-nearest neighbors (kNN), SVM, and ensemble algorithms: bagging (RF), boosting and voting based on training set. The performance of the models was assessed by testing set in terms of F1-score, precision, and recall. Islam et al. [38] conducted a study on three countries such as Bangladesh, Nepal, and India. They included 818603 respondents with seven risk factors and performed GT, RF, GBM, XGB, LR, LDA algorithms for predicting HTN patients. They focused that XGB achieved the best performance score than others. Chai et al. [56] used Malaysian data with 2461 respondents and 11 covariates to develop a system for diagnosing HTN patients by 3 different types of algorithms, including neural network (MLP), classical model (LR, DT, NB, k-NN), and ensemble model (RF, SVM, GB, XGB, LightGBM, CatBoost, AdaBoost, and LogitBoost). Before building the model, they adopted correlation-based FSM to select a set of leading features and utilized SMOTE technique to balance membership class label of the data. They evaluate the predictive ability of the models by sensitivity, specificity, accuracy, precision, F1-score, misclassification rate, and AUC on testing set and found that LightGBM based model acquired the best accuracy with 74.39%. Islam et al. [57] used nationally representative HTN data in Bangladesh. The data consisted of 6965 subjects with 13 risk factors. They determine the prominent risk factors of HTN by two popular FSM such as LASSO and SVMRFE in Bangladesh. They utilized then K10 protocol to construct model using four ML algorithms on training set and measured the performance of the models on testing set using accuracy, precision, recall, F1-score and AUC. Overall experimental sittings demonstrated that gradient boosting model attained the best score of AUC (0.669). Zheng et al. [58] explored a system for predicting HTN patients using several ML techniques in USA. No feature selection method had used to select the prominent features of HTN before constructing ML-based system. They found that ANN model reached the maximum performance score. Alkaabi et al. [59] utilized HTN data in Qatar. The dataset comprised of 987 respondents with 12 risk factors. They adopted 3 ML-based algorithms including DT, RF, and LR. Overall experimental results anticipated that RF model provided better generalization predictive ability than others. Thus, the comparative results suggested that our proposed XGB framework can predict HTN with higher AUC (Table 7). Moreover, SHAP analysis with the proposed method revealed that age, weight, fat, income, diabetes, BMI, height, salt, smoking, and HHTN were the associated risk factors for developing HTN. Local explanation summary plot showed that age is the 1 st leading risk factor of HTN in Ethiopia. A study conducted by Belay et al., [2022] in Ethiopia found that a patient with age>60 years was two times more likely to have HTN than those with age 18-40 years [11]. This result also supported by several systematic review and meta-analysis studies [60,61]. The vascular system of our body changes in arteries, particularly with large artery stiffness caused by older age. Weight and fat are the 2 nd and 3 rd leading drivers of HTN. This finding supports the conclusions of earlier investigations [62]. Excess body weight increases visceral and retroperitoneal fat, which can contribute to the development of HTN. Household income is linked to the risk of HTN, which was in line with the prior investigations [63]. Due to a number of reasons, including the ongoing nutritional transition, rising trends in sedentary lifestyle, and other modifiable risk factors, people from lowincome families may have a greater burden from the disease [64]. BMI is another gradient of HTN which is corroborated with the earlier studies [65]. BMI might be a cause of HTN and other cardiovascular disease by stimulating the renin-aldosterone system and endothelial dysfunction [66]. Diabetes is another important marker of HTN. The two medical conditions diabetes and HTN may cause each other and share common risk factors. HHTN is another important covariate of HTN. This result is also coincided with the previous studies conducted in Ethiopia and other countries [67]. This might be as family member share same genetic factors, behaviors, mostly similar lifestyle, and environments related factor that could influence the risk of HTN disease. Additionally, other risk factors such as salt, drinking alcohol, and smoking were found to be an important contributing risk factors of HTN, which is similar with other studies in literature [68,69]. Although this work has many strengths, it also has some limitations, such as the sample only included permanent the residents of the city administration who had lived in the area for more than six months and were older than 30. Additionally, it did not measure the amount of alcohol, cigarettes, fruits, vegetables, fats, and salts that were consumed in measurable units.

Conclusions
In this study, we adopted four different machine learning algorithms to build the most appropriate predictive model for classification of HTN. Overall experimental results anticipated that, among four models, the XGB model is the most appropriate model for predicting patient with the risk of HTN. The SHAP analysis revealed that age, weight, fat, income, BMI, diabetes, salt, HHTN, drinking, and smoking are the high contributing risk factors for developing HTN. Therefore, the proposed integrating system can be conveniently utilized as a useful tool in clinical sittings to accurately identify the patients with the risk of HTN at an early stage. With the help of this information, a doctor can make decisions that will reduce healthcare costs and time while also enabling individualized interventions and targeted treatment to minimize the burden of HTN in Ethiopia.